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Abstract 

We demonstrate a method for the prediction of chemotherapeutic response in patients using only before-treatment 
baseline tumor gene expression data. First, we fitted models for whole-genome gene expression against drug sensitivity 
in a large panel of cell lines, using a method that allows every gene to influence the prediction. Following data 
homogenization and filtering, these models were applied to baseline expression levels from primary tumor biopsies, 
yielding an in vivo drug sensitivity prediction. We validated this approach in three independent clinical trial datasets, 
and obtained predictions equally good, or better than, gene signatures derived directly from clinical data. 



Background 

Identifying and applying molecular biomarkers to pre- 
dict response to medication is particularly important for 
drugs with a narrow therapeutic index, for example che- 
motherapeutic agents, because response is highly vari- 
able and side effects are potentially lethal [1,2]. Many 
studies have been conducted with this objective but only 
a handful of markers can reproducibly predict chemo- 
therapeutic response in the clinic [3]. It is anticipated 
that the number of biomarkers discovered will rise as 
high-throughput sequencing becomes cheaper and more 
pervasive [3,4]; however, the effect size of these markers 
is generally small, since drug response is typically a com- 
plex trait, usually influenced by many genomic and en- 
vironmental factors [3,4]. Thus, it has been hypothesized 
that methods that consider the cumulative effect of many 
markers, may predict complex phenotypes (like drug re- 
sponse) more accurately. Consequently, some researchers 
have recently developed sophisticated methods that in- 
corporate all of the data in a genome. For example, there 
has been some success in using whole-genome SNP or se- 
quence data to predict complex traits [5,6]. 

In cancer, genomic aberrations and aneuploidy are 
common, which means that it is difficult to obtain reli- 
able SNP or genome sequences directly from tumors [7] . 

* Correspondence: rhuang@medicine.bsd.uchicago.edu 

'Section of Hematology/Oncology, Department of Medicine, University of 

Chicago, Chicago, IL 60637, USA 

Full list of author information is available at the end of the article 



However, the quantification of whole-genome gene ex- 
pression levels from primary tumor biopsies is straight- 
forward and has been successfully applied for many 
years [8,9]. Unfortunately, prediction using gene expres- 
sion microarray data has traditionally been fraught with 
reproducibility issues [10]. One of the major concerns is 
that gene expression estimates, generated on different 
microarray platforms or even in different batches, are 
not always consistent [11]. Several analytical approaches 
have recently been suggested to address this problem 
and a large-scale comparison has found that some of 
these methods reliably correct for these biases [12]. Also, 
multiple studies have compared the performance of vari- 
ous algorithms for predicting survival phenotypes from 
microarray expression data [13,14]. These have found 
that ridge regression (a type of regularized linear regres- 
sion that can include the expression of all genes in the 
model) performed best, or was consistently amongst the 
best performing methods. However, gaps remain in the 
utility of these tools in predicting clinical phenotypes. 

Here, we present an approach that integrates several 
of these recently developed computational and statistical 
tools to predict in vivo drug response, using models 
trained on cell line data (see Materials and methods). 
For model development, the approach was applied to re- 
cently released data from the Cancer Genome Project 
(CGP) [15], consisting of baseline (i.e. before drug treat- 
ment) gene expression microarray data and sensitivity to 
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138 drugs in a panel of almost 700 cell lines. Our results 
demonstrate that by building a statistical model from 
these data, it is possible to capture a significant propor- 
tion of the variability in drug response in patients. The 
Cancer Cell Line Encyclopedia [16] has an additional large 
panel of cell lines, for which it is possible to construct 
such models, although here, we focus on the CGP, because 
those cell lines have been screened against more drugs. 

To test our approach, we identified clinical trial data- 
sets that had assessed tumor gene expression before 
drug treatment (using expression microarrays) and had 
subsequently measured a clear drug response phenotype. 
Using these data, we can test whether our models de- 
rived from cell lines capture a significant proportion of 
the variability in drug response in patients. The clinical 
datasets must fulfill the following criteria. Firstly, the 
clinical trial data (both baseline tumor expression and 
post-treatment drug response) must be publicly available 
and easily accessible to allow other researchers to repro- 
duce the results. Patients must have been treated with 
monotherapy, rather than a combination of drugs, as 
multi-drug regimes would clearly confound the results. 
The data must have been published and not retracted. A 
reasonable number of clinically evaluable samples (>20) 
are required for statistical power. Finally, sensitivity to 
the particular drug (as measured by the concentration 
required for 50% of cellular growth inhibition (IC50)), 
must have been quantified in the CGP cell lines, because 
we cannot create suitable models otherwise. To our 
knowledge, there are four existing datasets that fulfill 
these criteria [17-20] and the results of our analysis of 
these data are presented below. Interestingly, the four 
trials were for three different types of cancers treated 
with either cytotoxic or targeted agents. 

Results 

Our goal was to use baseline gene expression and in vitro 
drug sensitivity derived from cell lines, coupled with 
in vivo baseline tumor gene expression, to predict patients' 
response to drugs. An overview of our approach is shown 
in Figure 1 (complete details are in Materials and 
methods; see Data availability for details of how to acquire 
the R code). Ridge regression models, which allow a small 
contribution from every gene, have previously been shown 
to be the best method for predicting survival from gene 
expression microarray data [13,14]. Our analyses are con- 
sistent with these previously published findings. In prelim- 
inarily tests, we assessed several of the plethora of 
available machine learning algorithms, including random 
forests [21], PAM [22], principal component regression 
[23], Lasso [24] and ElasticNet regression [25]. Among 
them, ridge regression was consistently the best performer, 
with the added advantage of being highly computationally 
efficient, which is crucial for cross-validation analysis. 



Furthermore, principal component analysis (PCA) demon- 
strates that whole-genome gene expression can capture 
far more information about cancer biology, than may have 
been previously appreciated. As illustrated in Figure 2 and 
Additional file 1: Figures SI and S2, whole-genome gene 
expression recapitulates tissue of origin, cancer subtype 
and various genomic aberrations, when the CGP cell lines 
are plotted on the first two principal components of a 
whole-genome gene expression matrix. This suggests that 
whole-genome gene expression acts as a surrogate for un- 
measured genetic and non-genetic phenotypes, providing 
additional support for this approach. 

Docetaxel and cisplatin treatment of breast cancer 

We first applied our method to gene expression micro- 
array data obtained from 24 breast cancer tumor biopsies 
through a clinical trial, which measured the response of 
patients to docetaxel neoadjuvant treatment [18]. Tumor 
size, measured before and after four cycles of docetaxel, 
was used to calculate the percentage of residual disease. 
The authors designated individuals as 'sensitive' or 'resist- 
ant' to docetaxel, depending on whether there was <25% 
or >25% of the tumor remaining. Tumor gene expression 
levels were measured from biopsies using Affymetrix 
microarrays (GEO accession number [GEO:GSE6434]). 
In the original study, receiver operating characteristic 
(ROC) curve [28,29] analysis reported an area under the 
curve (AUC) of 0.96 (from leave-one-out cross-validation 
(LOOCV)) using a 92-gene signature derived from the 24 
samples. However, given that this signature was generated 
on the same data on which it was evaluated, this is likely 
to represent an inflated estimate of the classification 
accuracy. 

To compare our method to these results, we used the 
CGP cell lines to build a ridge regression model, which 
related whole-genome gene expression to docetaxel 
sensitivity. We applied the model to the in vivo pretreat- 
ment breast cancer tumor expression data. The pre- 
dicted drug sensitivity value was lower in the patients 
who were defined (by the trial) to be sensitive to doce- 
taxel, compared to the patients defined as resistant 
(Figure 3a; P = 4.0 x 10"^ from f-test). Of the seven indi- 
viduals who were predicted to be most sensitive, six are 
in the trial-defined sensitive group. ROC curve analysis 
revealed an AUC of 0.81 (Figure 3b; P = 5.0 x 10"^). Not- 
ably, training (cell lines) and test (clinical trial) data were 
assessed using different microarray platforms and the 
training set contained only 24 breast cancer cell lines. 
Interestingly, when the models were trained on these 24 
breast cancer cell lines alone, there was no difference in 
predicted drug sensitivity between the trial-defined sen- 
sitive/resistant groups (P = 0.65 from a i-test). This sug- 
gests that the non-breast cancer cell lines included in 
the full training panel are informative for predicting the 



Geeleher et al. Genome Biology 2014, 15:R47 
http://genomebiology.com/201 4/1 5/3/R47 



Page 3 of 1 2 



Drug 

Data 
for 
Cell 
Lines 



Raw 
Microarray 

Data 
[Cell lines] 



Raw 
Microarray 
Data 

[Clinical Trial] 



^ - Sui 



I 



Preprocess Data 

- Background correct, 
mmarize using BrainArray CDF. 

- Quantile normalize. 
Do this separately for 

both datasets. 



Baseline Gene 
Expression Levels 
[Cell lines] 



1 



"A r 



Baseline Gene 
Expression Levels 
[Clinical Trial] 




2. Combine Datasets 

Subset the datasets to common genes. 
Integrate datasets using ComBat. 
Perform feature selection. 



I 



Homogenized Expression Data 



Cell line 
Expression Data 



Clinical Trial 
Expression Data 



3. Fit Ridge 
Regression 
Model 



Ridge 
Regression 
Model 



Predict 
^ / in vivo \ 
^\ Drug J 
V Sensitivity 



Figure 1 Data flow diagram sliowing our approach to predicting in vivo drug sensitivity. Data are represented by rectangles and 
processes by ovals. The input data (baseline expression and drug IC50 in cell lines and in vivo tumor gene expression) are shown in a gray 
box. The raw microarray data are (1) preprocessed separately using the robust multi-array average [25] method and the CDF files remapped 
by BrainArray [27] are summarized, (2) then combined and homogenized using ComBat. (3) A ridge regression model is fitted for baseline gene 
expression levels in the cell lines against the in vitro drug IC50 estimates and (4) this model is then applied to the baseline tumor expression 
data from the clinical trial, to yield drug sensitivity estimates. Complete details are given in Materials and methods. CDF, chip definition file. 



in vivo drug response for breast cancer. For comparison, 
ElasticNet and Lasso regression models were also ap- 
plied to this data, but both underperformed when com- 
pared to ridge regression {P = 0.01 from t-tests for both 
models; Additional file 1: Table SI; see Materials and 
methods for details). 

Next, we applied our method to a second breast can- 
cer dataset, which assessed the response of 24 triple- 
negative patients to neoadjuvant cisplatin therapy [20]. 
We downloaded the raw data from ArrayExpress (acces- 
sion number E-GEOD-18864) and processed it as de- 
scribed in Materials and methods. The authors assigned 
patients to one of four drug response categories based 
on RECIST [30] criteria. This time, our models did not 
capture variability in clinical response (Additional file 1: 



Figure S3; P = 0.26 from a linear regression model). 
LOOCV (see Materials and methods) indicated that, for 
the cell line panel, our models captured approximately 
the same proportion of variability in cellular response to 
cisplatin as they had for docetaxel (r = 0.35, P = 2.6 x 10" 
for docetaxel and r = 0.32, P=lAx 10"^^ for cisplatin 
from Pearson's correlation test between LOOCV esti- 
mated log IC50 and measured log IC50 values). Thus, it is 
surprising that we could not also predict the in vivo re- 
sponse to cisplatin. Notably, the authors of the original 
trial could not generate a gene signature from their data, 
or show that any signature in the literature captured cis- 
platin response in vivo. Furthermore, they found that no 
genes were significantly correlated with response, fol- 
lowing correction for multiple testing [20]. Therefore, it 
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Figure 2 Whole-genome gene expression data capture molecular phenotypes in cancer cell lines, (a) Clustering of cancer types on 
principal component (PC) 1 and PC2 of a gene expression matrix from the CGP cell lines. There is clear clustering of blood, central nervous 
system and lung cancers. For clarity, only these three cancer types are shown here. Additional file 1: Figure SI shows all cancer types, (b) 
Clustering of subtypes of hematological cancers on PCI and PC2 of a gene expression matrix of CGP hematological cancer cell lines. For clarity 
only acute myeloid leukemia, acute lymphoblastic leukemia and B-cell lymphoma are shown here. All data is shown in Additional file 1: Figure S2. 
(c) Clustering of ERBB2 amplified breast cancers on PCI and PC2 of a gene expression matrix of CGP breast cancer cell lines, (d) Clustering of 
BRAF mutated cancers on PCI and PC2 of a gene expression matrix from all CGP cell lines. ALL, acute lymphoblastic leukemia; AML, acute 
myeloid leukemia; CGP, Cancer Genome Project; CNS, central nervous system; CNV, copy number variation; MT, mutated; PC, principal component; 
WT, wild type. 
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Figure 3 Prediction of docetaxel sensitivity in breast cancer patients, (a) Strip chart showing the difference in predicted drug sensitivity for 
individuals sensitive or resistant to docetaxel treatment in vivo, (b) ROC curve showing the proportion of true positives against the proportion of 
false positives as the classification threshold is varied. ROC, receiver operating characteristic. 
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is possible that we (and the original authors) could not 
achieve statistical significance, because of the lack of 
variability in drug response among a small group of pa- 
tients, as cisplatin is not routinely used to treat breast 
cancer [20]. Encouragingly, patients showing a 'complete 
response' or 'progressive disease' had the lowest and high- 
est median predicted drug sensitivity values, respectively 
(Additional file 1: Figure S3); but given that there were 
only three individuals in each of these groups, it is not sur- 
prising that we did not establish significance. Consequentiy, 
a larger clinical cohort may be required to assess rigorously 
whether our models capture variability in cisplatin response 
for triple-negative breast cancer. 

Bortezomib in myeloma 

Next, we applied our approach to a larger publicly avail- 
able clinical phase II/III trial dataset, which assessed re- 
sponse to bortezomib in relapsed multiple myeloma 
patients [19]. In the original study, a pretreatment bone 
marrow aspirate was collected and enriched for tumor 
cells, which underwent microarray expression profiling. 
It was found that 168 patients had a clinically evaluable 
bortezomib response, which was classified as complete 
response (CR), partial response (PR), minimal response 
(MR), no change (NC) or progressive disease (PD) [19] 
using European Group for Bone Marrow Transplantation 
criteria [31]. CR, PR and MR patients were defined as 
responders and NC and PD patients as non-responders. 
Expression in tumor cells was measured using either Affy- 
metrtx Human Genome U133A or U133B arrays in tripli- 
cate. The same samples were interrogated using both A 
and B arrays. Data were processed using the Affymetrix 
MAS5.0 algorithm and the median expression value of the 
replicates that passed quality control was reported. 

This clinical dataset presents some obstacles. Firstly, 
only the preprocessed data is publicly available (GEO ac- 
cession number [GEO:GSE9782]). The fact that the raw 
microarray data are not available is problematic because 
the lack of standardized raw data processing likely 
lowers the performance of our model. Also, clinical sam- 
ples were collected as part of three different clinical tri- 
als from various sites, and they were also hybridized in 
different batches, reflecting the types of issues that may 
be encountered were one to apply such a method in the 
clinic. Furthermore, there is only a single myeloma cell 
line in the training panel. 

LOOCV in the cell line training set revealed similar 
correlations to other drugs (r=0.45; P = 2.6 x 10"^^). 
Despite the suboptimal clinical data, our method cap- 
tured substantial variability in bortezomib response. 
There was a statistically significant difference between 
the predicted drug sensitivity in patients between the 
trial-defined responder and non-responder groups 
(Figure 4a; P=8.9 x 10 * for samples quantified using 



U133A and Additional file 1: Figure S4; P=1.5xl0 
for samples quantified using U133B from i-tests). In the 
U133A dataset, the nine patients who were predicted to 
be most sensitive were all drug responders (Figure 4a). 
The AUCs from ROC curve analysis are 0.63 and 0.71 
for U133A and U133B measurements, respectively 
(Figure 4b; P= 1.3x10"^ and Additional file 1: Figure 
S5; P = 1.0 X 10"^). Strikingly, when the response was 
further subdivided (as CR, PR, MR, NC and PD), the 
median predicted drug sensitivity in each of these five 
groups was in exactly the correct order (Figure 4c and 
Additional file 1: Figure S6) in the U133B samples. 

The authors of the original clinical study reported that 
a 100-gene signature model [32], built on two arms of 
the trial (025 and 040), could predict bortezomib re- 
sponse in the third (039) arm of the trial with 63% ac- 
curacy. To compare our predictions with those originally 
reported, we assessed the performance of our model on 
only this third arm of the trial. Our models generate a 
continuous variable and to compare the results previ- 
ously reported directly, we must dichotomize this vari- 
able (i.e. split the data into 'sensitive' and 'resistant' at an 
arbitrary cut-point). At the optimal cut-point (-5.29), 51 
of 71 patients were correctly classified, meaning that our 
method achieved a classification accuracy of 72%. For a 
large range (-5.57 to -4.53) of possible cut-points, our 
accuracy was greater than the 63% achieved by the trial- 
derived gene signature (Figure 4d). While dichotomizing 
clinical response data is not ideal [33], the original clin- 
ical data is again not available, thus this is the only 
means of directly comparing the predictions. Neverthe- 
less, the results indicate that our models offer a substan- 
tial performance improvement. 

This study also contained a group of 70 patients who 
were treated with dexamethasone (and had a clinically 
evaluable drug response). It was not possible to con- 
struct a dexamethasone specific model as this drug was 
not screened against the CGP cell lines. This group is 
still suitable as a negative control and thus we applied 
the bortezomib model in this cohort. Encouragingly, 
there was no difference in predicted bortezomib sensitiv- 
ity between responders and non-responders to dexa- 
methasone {P = 0.81 from a t-test), suggesting that the 
models applied to bortezomib-treated patients are drug 
specific. 

Eriotinib in non-small cell lung cancer 

Finally, we applied our approach to a dataset from the 
Biomarker-Integrated Approaches of Targeted Therapy for 
Lung Cancer Elimination (BATTLE) study (trial registra- 
tion ID: NCT00409968) [17,34]. A subset of patients 
with recurrent or metastatic non-small cell lung cancer 
(NSCLC) were treated with either eriotinib (« = 25), an 
EGFR inhibitor, or sorafenib (n = 37), a VEGFR inhibitor. 
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Figure 4 Prediction of bortezomib sensitivity in multiple myeloma patients, (a) Strip cliart and boxplot of predicted drug sensitivity for 
in vivo responders and non-responders to bortezomib. (b) ROC curve illustrating estimated prediction accuracy, (c) Strip chart and boxplot with 
responders and non-responders further broken down as showing CR, PR, MR, NC or PD. (d) Strip chart and boxplot illustrating our predictions 
for the (039) arm of the bortezomib trial. The horizontal black line indicates the optimal cut-point, where classification accuracy is 72%. The 
horizontal dashed black lines indicate the range of cut-points for which classification accuracy is >63%, which was the accuracy reported for the 
trial-derived 100-gene signature on this dataset. The horizontal dashed red line is the mean IC50 value in the cell line training set, which could be 
used as an unbiased cut-point. CR, complete response; IVIR, minimal response; NC, no change; PD, progressive disease; PR, partial response. 



in a second-line setting. Raw microarray and drug sensitiv- 
ity data were downloaded from GEO ([GEO:GSE33072]). 

Inspection of the training data revealed that only a 
very small proportion of the cell lines assessed for sensi- 
tivity to these drugs were within the drug screening con- 
centration used by the CGP. This is the case for many 
targeted agents. In contrast, most cell lines treated with 
cytotoxic agents, for example docetaxel, have more ac- 
curately quantified IC50 values, because a much larger 
proportion of cell lines tended to respond within the 
sensitivity screening window. The drastically different re- 
sponse of cell lines to cytotoxic or targeted agents is 
illustrated in Additional file 1: Figure S7. This can be 
rigorously demonstrated by segmenting all drugs into 
two groups (cytotoxic or targeted) and comparing the 
median size of the confidence intervals associated with 
the IC50 values of each drug. Unsurprisingly, the confi- 
dence intervals are larger for targeted agents (average of 
1.9 for cytotoxic compared to 4.5 for targeted agents; 
P = 1.4 X 10" from a Wilcoxon rank sum test). Consist- 
ent with this, the signal-to-noise ratio is also significantly 



different for cytotoxic and targeted drugs (P = 7.1 x 10" 
from a Wilcoxon rank sum test). In light of this, it is not 
reasonable to fit a linear ridge regression for most tar- 
geted agents, because IC50 values for most cell lines were 
derived using extrapolated data, and thus have very large 
associated confidence intervals. An approach that re- 
duces the level of noise fitted in the model will be more 
suited for targeted agent sensitivity prediction. Conse- 
quently, we fitted logistic ridge regression models for 
the 15 most sensitive (which had reliably measured 
IC50 values) versus the 55 most resistant CGP cell lines 
(see Materials and methods). In LOOCV, this method 
provided 89% classification accuracy on the training 
set and separated sensitive and resistant groups with 
P = 9.3 X 10"^, providing additional support for applying 
this approach to clinical samples. 

When applied to the clinical trial data, this modified 
approach captured a large proportion of variability in 
the in vivo erlotinib response (Figure 5a; rho = 0.64 and 
P = 5.3 X 10" from a Spearman's correlation test). All 
patients were EGFR wild-type. Since KRAS mutation is a 
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Probability of Sensitivity Probability of Sensitivity 

Figure 5 Prediction of eriotinib sensitivity in NSCLC patients, (a) Months-to-progression plotted against predicted probability of eriotinib 
sensitivity. All patients are EGFR wild-type. KRA5 mutations are highlighted and a linear regression line is shown, (b) The predicted probability 
of sensitivity to eriotinib plotted against months-to-progression for individuals treated with sorafenib. NSCLC, non-small cell lung cancer. 



known biomarker of resistance to EGFR inhibitors 
[35-37], we evaluated performance in the subset of 20 
individuals who were both EGFR and KRAS wild-type. 
The results remained highly significant (rho = 0.59 and 
P= 6.4x10" from a Spearman's correlation test). This 
means that we enriched for drug responders, even in the 
absence of any known biomarker. A further interesting 
observation is that one patient, who was among the 
most sensitive to eriotinib, had a KRAS mutation, nor- 
mally a biomarker of EGFR inhibitor resistance. Our ap- 
proach predicted this would be the fifth most sensitive 
individual in the cohort, and they were in fact, the fifth 
most sensitive individual. 

The authors of the original study developed a 76-gene 
epithelial-mesenchymal transition (EMT) gene expres- 
sion signature using both NSCLC cell lines and patient 
data. They reasoned, as EMT had been previously shown 
to be associated with EGFR inhibitor resistance, that this 
signature may capture variability in eriotinib response 
in vivo. The gene signature was applied to the 20 EGFR 
and KRAS wild-type NSCLC patients treated with erioti- 
nib. They found that individuals with disease control at 
eight weeks showed a more epithelial-like signature and 
the result was of borderline significance {P = 0.052 from 
a i-test). For comparison with our approach, we assessed 
the difference in predicted probability of eriotinib sensi- 
tivity (from the logistic ridge regression model) between 
individuals with disease progression and those without 
disease progression at two months. In our case, the dif- 
ference was highly statistically significant {P = 4.9 x 10" 
from a i-test). This suggests that whole-genome gene ex- 
pression models, derived from a large panel of cell lines, 
have superior power to predict eriotinib sensitivity, com- 
pared to the 76-gene EMT signature. 

Since the drug sensitivity phenotype evaluated in this 
trial is 'months-to-progressioni it is possible that our 
models are capturing a prognostic phenotype, rather 



than drug sensitivity. To test this, we applied our cell- 
line-derived eriotinib sensitivity prediction model to pre- 
dict months-to-progression for an independent arm of 
the same trial in which NSCLC patients were treated 
with sorafenib. The erlotinib-specific model was not pre- 
dictive of months-to-progression after sorafenib treat- 
ment (Figure 5b; P=0.83 from a f-test), suggesting that 
the model is drug specific, rather than a general pre- 
dictor of disease progression or prognosis. 

Discussion 

We have shown that models constructed using baseline 
gene expression and drug IC50 values, from a large panel 
of cell lines, can predict the chemotherapeutic response 
in patients. Our method uses whole-genome ridge re- 
gression, and the expression of every gene contributes a 
small amount to the prediction. The ridge regression 
penalty parameter is automatically selected, meaning 
that no user input is required to tune the algorithm. Our 
approach captures a statistically significant proportion of 
variability in drug response in three of four clinical trials, 
regardless of the drugs' mechanism of action or the pa- 
tients' cancer type and in all cases, performance was 
comparable to that of the gene signatures derived dir- 
ectly from clinical data. To our knowledge, this is the 
first time that a method capable of this has been 
described. 

It may seem surprising that whole-genome gene ex- 
pression alone has such remarkable power in enriching 
for drug responders, as the approach ignores what are 
thought to be important factors in drug sensitivity, such 
as cancer type, genomic aberrations or any other specific 
markers. We showed that the impressive performance 
may (at least in part) stem from the fact that whole- 
genome gene expression may act as a surrogate for 
unmeasured phenotypes that are directly relevant to 
chemotherapeutic sensitivity (Figure 2), and capture 
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aspects of both germ line and tumor-specific genome 
variation. This observation is further supported by mul- 
tiple studies that have shown that gene expression can 
be used to characterize novel cancer subtypes and has 
been shown to have predictive and prognostic value 
[38-43]. 

Our method attained classification accuracy approaching, 
or even surpassing that of the gene signatures derived dir- 
ectly from clinical trials. In these trials, gene signatures 
were derived using in vivo samples. In the case of docetaxel, 
the signature was generated on the same set of samples on 
which it was tested, which inevitably inflates their estimate 
of prediction accuracy. It would only be possible to com- 
pare performance fairly using an independent dataset. 
Nevertheless, our method significantly enriched for doce- 
taxel responders in the trial dataset. Our approach pre- 
dicted in vivo drug sensitivity more accurately than a 
100-gene signature derived from the bortezomib clinical 
trial. We also outperformed the 76-gene EMT signature 
(generated using both cell lines and patients) in predicting 
sensitivity to erlotinib. There was no evidence that KRAS 
mutation, previously identified as a biomarker of EGFR 
inhibitor resistance, had predictive power in this data, 
although the number of samples was small and this result 
does not discount KRAS mutation as a biomarker for this 
drug. The fact remains that our method also outperformed 
this (already established) biomarker in this dataset. We 
modified the original algorithm (to use logistic instead of 
linear ridge regression) in the analysis of erlotinib data. 
This is justifiable given the severe noise associated with 
the IC50 values for these types of targeted agents in the 
CGP cell lines. Overall, the results suggest that models 
created on very large panels of cell lines, can rival, or even 
surpass the performance of similar in vivo approaches. 

In all studies discussed here, the data are suboptimal, 
because all of the clinical trials used a different micro- 
array platform than was used on the cell line panel train- 
ing set. Also, the cell line panel often only contained a 
very small number of samples from the actual cancer 
type that the clinical trial evaluated. For example, only 
one myeloma cell line was treated with bortezomib in 
our panel of training cell lines. We anticipate that if 
training and test microarray platforms were the same 
and if the cell line panel contained more relevant cancer 
types, accuracy would be further improved. These results 
are congruent with the emerging view that -omics 
characterization of tumors may rival traditional tissue- 
of-origin and pathological descriptors for a variety of 
clinically important classifications. This was supported 
by our finding that, unlike the full cell line panel, the 24 
available breast cancer cell lines could not predict doce- 
taxel response for breast cancer patients (although it 
would be difficult to achieve accurate prediction using 
such a small training set). 



Performance would also likely be enhanced by a more 
detailed assessment of the transcriptome, for example, by 
quantifying transcript expression levels with RNA sequen- 
cing (RNA-seq), which has been shown to provide better 
estimates of expression than microarrays [44]. A recent 
study has also found that results of transcriptome sequen- 
cing using RNA-seq were highly reproducible between dif- 
ferent laboratories, if procedures are standardized (which 
is not generally the case for expression microarrays) [45]. 
This provides further evidence that incorporating RNA- 
seq would increase power and the widespread utility of 
these types of expression-based prediction assays. How- 
ever, a different machine learning algorithm may be better 
suited to the distribution of RNA-seq data. 

We have recently completed a separate study that 
provides additional support for emphasizing transcripto- 
mics in pharmacogenomic prediction. In that analysis, 
we used whole-genome models, similar to the ridge 
regression models used here, to compare the relative 
contribution of whole-genome SNPs, gene expression or 
microRNA (miRNA) expression, to inter-individual vari- 
ability in cellular growth rate [46]. We found that, in 
lymphoblastoid cell lines, far more of the variability in 
growth rate (between cell lines isolated from different 
individuals) can be explained by the transcriptome than 
by genome-wide SNPs. Using gene and miRNA expres- 
sion data, we constructed statistical models that ex- 
plained 48% of variability in growth rate, compared to 
just 2% when using models based on only whole- 
genome SNP data. Given that a substantial proportion 
of chemotherapeutic agents target fast-growing cells 
(for example, docetaxel), these results provide a strong 
rationale for prioritizing the transcriptome when pre- 
dicting clinical response to this class of drugs. The result 
also showed that combining miRNA and gene expres- 
sion data significantly improved prediction (from 38% 
to 48% of variability explained) over mRNA expression 
alone, suggesting that including miRNA and other non- 
coding RNAs may improve the prediction of clinical 
drug sensitivity, although no data is currently available 
that would allow us to test this hypothesis. 

Here, we have demonstrated that models derived from 
a very large panel of cell lines achieve equal or better 
performance for clinical drug sensitivity prediction than 
those derived directly from patients and these findings 
can have a profound impact on patient care. For ex- 
ample, screening for drug sensitivity in vitro is far less 
costly and time-consuming than conducting large clin- 
ical trials. A much larger number of samples can be 
screened against any given drug using cell lines than 
would be feasible (either practically or ethically) in a clin- 
ical setting. Furthermore, in vitro drug sensitivity screen- 
ing is usually conducted under controlled experimental 
conditions to achieve a greater degree of accuracy. The 
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fact that many more samples can be screened, with more 
accuracy, may lead to improved statistical power, com- 
pared to in vivo methods, which inevitably rely on smaller 
sample sizes and noisy clinical response phenotypes. This, 
and the fact that recently developed statistical methods ro- 
bustly correct for intrinsic differences in gene expression 
between cell lines and in vivo tumors, suggest that this 
type of approach is a very promising option for personaliz- 
ing drug treatment. There is no limit to the number of 
drugs that could be screened against a panel of cell lines. 
In theory, every existing chemotherapeutic compound 
could be tested, meaning that given a tumor biopsy, it 
would be trivial to use this approach to estimate sensitivity 
to every drug prior to any course of treatment. The falling 
price of gene expression microarrays makes it feasible to 
incorporate this technology into patient care. The cell line 
training set could also be expanded by including data on 
cellular sensitivity within different microenvironments and 
cell lines under different simulated stromal conditions, as 
it is known that the tumor microenvironment plays a key 
role in tumor development [47]. 

The results also have important implications in drug 
development, where our approach could be used to en- 
rich for likely drug responders, prior to carrying out 
clinical trials. There has been much recent interest in 
developing drugs in conjunction with companion diag- 
nostic tests [48]. The benefits of enriching for likely drug 
responders are obvious, but it is normally difficult to de- 
velop accurate biomarkers without exposing the drug to 
patients. However, our approach, because of its ability to 
enrich for drug responders in a clinical cohort, has enor- 
mous potential as a companion diagnostic. There is also 
a clear ethical benefit, in that such a diagnostic could be 
developed without ever exposing potentially unrespon- 
sive patients to toxic chemotherapeutic agents. 

Finally, we highlight the work of Menden et al., who 
recently constructed models using the CGP cell line 
data, with the aim of predicting in vitro drug sensitivity 
[49]. The authors achieved impressive prediction {R^ = 
0.72 from eightfold cross-validation) by using models 
that consider the effectiveness of drugs with similar 
mechanisms of action. These results are better than 
those achieved using our models (based on expression 
data alone), but currently cannot be extended to in vivo 
data because clinical response to similar drugs is almost 
never known a priori. However, the results suggest that 
in future, in vivo prediction could be improved by con- 
sidering multiple drugs, using either prior information 
on what is known about mechanism of drug action, or 
simply the empirical correlation of drug sensitivities. 

Conclusions 

In summary, we have shown for the first time that it is pos- 
sible to enrich for drug responders in a clinical cohort 



using only baseline tumor gene expression levels, by apply- 
ing models generated from a large panel of cell lines. We 
have also demonstrated that this approach outperforms 
several existing biomarkers in the available clinical datasets. 
These findings have profound implications for personalized 
medicine and drug development. Future work will focus 
on improving predictions using more rigorous transcrip- 
tome quantification and further testing in prospective clin- 
ical trials. The R code [50] needed to reproduce all figures 
and results in this paper, is provided (for academic use) on 
our website in Sweave [51] format (see Data availability). 

Materials and methods 

Bioinformatics analysis overview 

Bioinformatics analyses were performed in R. Our imple- 
mentation is extremely fast (typically running in <30 s on 
a standard desktop computer) and easy to use. Once 
the data are correctly loaded, the user need only pro- 
vide the baseline gene expression and drug sensitivity 
data (i.e. IC50) from the cell line panel and baseline 
gene expression data from the clinical trial. The pre- 
dicted drug sensitivity is then calculated, requiring no 
further user input. All R code is provided in annotated 
Sweave format on our website (see Data availability), 
enabling other investigators to reproduce easily all the 
results and figures presented in this paper. 

Obtaining gene expression and drug sensitivity data 

Drug IC50 values for docetaxel, bortezomib and erlotinib 
were downloaded from the CGP website ([52]; accessed 
August 2013). The raw CGP gene expression microarray 
data (CEL files) were downloaded from ArrayExpress 
under accession number E-MTAB-783. These data were 
preprocessed using the robust multi-array average algo- 
rithm (implemented by the rma() function in the affy [53] 
library in R). This algorithm does background correction, 
quantile normalization and median-polish summarization 
in one step. For summarization, we used the updated pro- 
beset annotation chip definition file (CDF) provided by 
BrainArray (version 17.0.0 for Affymetrix HT Human 
Genome U133A arrays, probesets mapped to Entrez gene 
IDs). We followed the same set of steps to preprocess the 
docetaxel, cisplatin and erlotinib/sorafenib clinical trial 
gene expression data (using the appropriate BrainArray 
CDF file in each case). The bortezomib expression data 
were obtained directly from GEO using the getGEO() 
function implemented in the R library GEOquery [54]. All 
in vivo drug response data were obtained from GEO or 
directly from the relevant publication. 

Combining and homogenizing cell line and clinical trial 
gene expression datasets 

Training (cell lines) and test (clinical trial) datasets were 
mapped to official gene symbols. Probesets that mapped 
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to more than one gene symbol were summarized by 
their mean expression value. In all cases, both datasets 
were generated on different microarray platforms, thus 
we used a subset of genes represented on both plat- 
forms. This typically left approximately 10,000 gene 
symbols remaining. These two datasets were then 
homogenized using the ComBat() function from the sva 
library in R. Finally, we filtered out genes whose expres- 
sion did not vary substantially in the homogenized 
dataset, because if technical variability is greater than 
biological variability, these can never add predictive 
value; we removed the 20% of genes with lowest variabil- 
ity in expression across all samples. 

Predicting in vivo drug sensitivity using linear ridge 
regression for docetaxel, cisplatin and bortezomlb clinical 
trials 

Once the data were prepared as outlined above, a linear 
ridge regression model was fitted for in vitro drug sensi- 
tivity dependent on the homogenized whole-genome ex- 
pression levels in the CGP cell lines (for which both 
drug sensitivity and expression data were available). To 
do this, we used the linearRidge() function from the 
ridge [55] package in R. This function implements a 
method to choose the ridge regression tuning parameter 
automatically. Before fitting the model, the drug sensitiv- 
ity phenotype data (IC50 values) were power transformed 
using the powerTransform() function in the R package 
car. After the model was fitted, it was then applied to 
the homogenized gene expression data from the clinical 
trial, using the predict.linearRidge() function from the 
ridge package in R, thus yielding a drug sensitivity esti- 
mate for each patient. 

Leave-one-out cross-validation 

For LOOCV, all data were first preprocessed and ho- 
mogenized as described above. Then, ridge regression 
models were fitted (as above) on all of the available cell 
line data, but with one sample omitted. Next, these 
models were used to calculate a predicted IC50 value, 
using the gene expression data of the single omitted cell 
line. This process was repeated iteratively for every sam- 
ple thus yielding a predicted IC50 for every cell line. 
These predicted IC50 values were then compared to the 
measured IC50 values, using a Pearson's correlation test, 
giving an estimate of prediction accuracy. 

ElasticNet and Lasso models 

ElasticNet and Lasso regression models were fitted using 
the glmnet package in R. The Lasso penalty parameter 
was selected using the automatic cross-validation feature 
(i.e. the cv.glmnet() function). ElasticNet penalty param- 
eters (alpha and lambda) were selected using the caret 
package in R. Optimal parameters were selected using a 



grid search on the cell line training set, which takes 
approximately one day to run on a standard PC. Param- 
eters were selected based on an optimal value or 
Cohen's kappa (in cross-validation) for linear and logistic 
models, respectively. 

Predicting in vivo drug sensitivity using logistic ridge 
regression for an eriotinib clinical trial 

The data were first prepared as described above. Next, 
we divided the cell line training data into sensitive 
(15 samples) or resistant (55 samples) groups and fitted 
a logistic ridge regression model using the logisticRidge 
0 function from the R package ridge. Again, the ridge 
regression tuning parameter was automatically selected. 
As this implementation of logistic ridge regression is ex- 
tremely computationally intensive, we implemented a 
feature selection step, where only the 1,000 genes that 
were most differentially expressed between the 15 sensi- 
tive and 55 resistant samples were fitted in the model. 
These genes were selected using i-tests, specifically 
using the rowttests() function in the R library genefilter 
[56]. This step enables a standard desktop computer to 
fit a model in approximately ten minutes (as opposed to 
days). Once the model is fitted, it is applied to the ho- 
mogenized gene expression data from the clinical trial, 
using the predict.logisticRidge() function, which calcu- 
lates the predicted log-odds of drug sensitivity. 

Statistical analysis of results 

ROC curve analysis was performed using the ROCR [28] 
package in R. Empirical P-values were generated using 
100,000 sample label permutations and computing the 
proportion of permutations for which the AUC was 
more extreme than that observed in the original data. 
Linear regression, f-tests and Spearman's correlation 
tests were performed using the base functions in R. 
Figure 1 was generated using the Inkscape software and 
the remaining figures were generated in R. 

Data availability 

Annotated R code (in Sweave format) to reproduce all of 
the analysis in this paper is available from our website 
[57]. The CGP gene expression data are available from 
ArrayExpress under accession number E-MTAB-783. The 
IC50 data for the drugs is available from the CGP website 
[52] . The docetaxel data are available from GEO under ac- 
cession numbers [GEO:GSE349] and [GEO:GSE350]. The 
cisplatin data are available from ArrayExpress under ac- 
cession number E-GEOD-18864. The bortezomib data are 
available from GEO under accession number [GEO: 
GSE9782]. The eriotinib data are available from GEO 
under accession number [GEO:GSE33072]. Complete de- 
tails and R code showing how to acquire and preprocess 
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all of these data, as well as the associated clinical data are 
available in Sweave format on our website. 

Additional file 



Additional file 1: PDF file containing all supplementary figures, 
tables and their associated legends. 
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